CommunityOverCode Asia 专题介绍之数据存储与计算

ALC Beijing

2024-09-12

前言

CommunityOverCode Asia 2023

云计算、物联网、人工智能、5G 等新一代信息技术的进步，加速提升了网络的承载能力，并进一步推动云-边-端计算环境的应用，加快了信息技术与传统产业的深度融合。面对数据量爆炸式的增长，如何实现资源的灵活高效利用，并满足高吞吐低延迟的存储需求是数字化转型进程中企业的重要议题。同时，高性能计算不断发展，海量数据的长期保存需求进一步推高了存储成本，提升存储资源的利用率、降低存储成本也是企业亟需探讨的问题。

本次 CommunityOverCode Asia 2023（原 ApacheCon Asia）的数据存储与计算专题，将给大家带来 Apache 相关项目的最新资讯，现在就一起来看看吧！

出品人

CommunityOverCode Asia 2023

李岗

CommunityOverCode Asia 2023

Apache 软件基金会 Member & Apache IPMC Member/Mentor，Apache DolphinScheduler Initial committer & PMC Member，Apache Local Community (ALC) Beijing Member，现担任联想集团资深数据架构师。

专题介绍

CommunityOverCode Asia 2023

大数据是计算机科学的一个重要分支,大数据存储和计算领域的研究和创新从未停止。大数据正在深刻的引领和改变着各个行业，已经与我们的生活密不可分。

大数据也是 ASF 非常重要的组成部分，ASF 有非常多的大数据存储和计算领域的项目，比如大家熟知的 Apache Hadoop, Apache Spark, Apache HBase, Apache Ozone, Apache CarbonData, Apache Cassandra, Apache ZooKeeper, Apache Celeborn (Incubating) 等等，在这个主题中，大家会学习到这些技术的前沿趋势和来自一线用户的实践经验、原理、架构分析等精彩内容。

议程亮点

CommunityOverCode Asia 2023

8 月 18 日 13:30 - 17:15

演讲议题：What's new in the recent and upcoming HBase releases

分享时间：8 月 18 日 13:30 - 14:00

议题介绍：

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. The HBase community is preparing new major release 3.0.0 and new minor release 2.6.0, with some brand new features.

In this presentation, we will introduce these new features, about how they benefit our users and how we implement them in HBase：

1. Tracing Improvements: OpenTelemetry integration;

2. TLS Support: secure and encrypted rpc communication;

3. Cloud Native Support: Better OSS support, k8s deployment, etc.

4. Other Notable Improvements: HBase on ozone, new region replication framework, etc.

Additionally, we will delve into our plans for the future and discuss the exciting directions in which HBase is heading.

嘉宾介绍：

张铎丨神策数据首席架构师，Apache HBase PMC Chair

清华大学计算机科学与技术系本硕，长期从事开源软件的开发与维护。2015 至今历任 ApacheHBase 项目的 Committer、PMC 成员、主席。2020 年成为 Apache 软件基金会的 Member。2018 年，在 Apache 软件基金会全球近 7000 名 Committer 中，贡献数量排名第三。曾任小米开源委员会主席，负责小米整体开源工作的规划与推进。目前在神策数据担任首席架构师。

演讲议题：Deep dive into resource manageability in ozone storage

分享时间：8 月 18 日 14:00 - 14:30

议题介绍：

Organizations need to manage resources allocated and used by different entities within it. In the context of Apache Ozone, resources are storage space and namespace (count of files, keys and directories). Apache Ozone provides capability to define, and control resource usages by specifying quota. Ozone provides ways to manage resources different from the hadoop system.

This talk will present the resource management capabilities, behavior with respect to multiple ozone features such as trash, snapshot, and comparison differences with the hadoop system.

嘉宾介绍：

Sumit Agrawal丨Cloudera Senior Staff Engineer

Sumit Agrawal works at cloudera, contributing to Apache Ozone distributed storage and also a committer. He has 16 years experience in IT industry and worked over various domain including data storage, cloud application and middleware.

演讲议题：Spark SQL Shuffle Join Improvement at eBay

分享时间：8 月 18 日 14:30 - 15:00

议题介绍：

Join operation is one of the most important and widely used operations in data warehouse.

The Join operator in Apache Spark is one of the most expensive operators, especially Shuffle Join.

In this presentation, we will introduce a series of Shuffle Join optimizations recently added at eBay.

Specifically,

1. Unwrap cast in join condition to use bucket join;

2. Enhance shuffle exchange reuse to reduce table scans;

3. Push down partial aggregation through Join.

嘉宾介绍：

王玉明丨eBay 软件工程师，Apache Spark PMC

eBay SQL on Hadoop 团队软件开发工程师，Apache Spark PMC Member and Committer，2022 SIGMOD Systems Award 获得者。从 Spark 1.5.0 开始参与 Apache Spark 的开发，并成为最活跃的代码贡献者之一。专注于SQL查询性能优化。

演讲议题：字节跳动千亿文件 HDFS 集群实践

分享时间：8 月 18 日 15:00 - 15:30

议题介绍：

随着大数据技术的深入发展，数据规模和使用复杂度越来越高，Apache HDFS 面临着新的挑战。在字节跳动，HDFS 既是传统 Hadoop 数仓业务的存储，也是存算分离架构计算引擎的底座，还是机器学习模型训练的存储底座。字节跳动大数据存储团队基于 HDFS 本身，搭建了服务于大规模计算资源调度跨多地区的存储调度能力提升计算任务稳定性；也提供了统合用户侧缓存、常规三副本、冷存的数据识别和冷热调度能力。本次分享介绍字节跳动如何认识新兴场景对传统大数据存储的新要求，并分享技术和运维体系演进来支持不同应用场景。

嘉宾介绍：

熊睦丨字节跳动基础架构工程师

字节跳动大数据存储底座工程师，主要负责大数据存储 HDFS 元数据服务演进和上层计算生态支持。

演讲议题：Apache Kyuubi & Celeborn (Incubating): 助力 Spark 拥抱云原生

分享时间：8 月 18 日 15:45 - 16:15

议题介绍：

在过去数年中，网易在大数据云原生领域进行了长足的探索。本次演讲围绕如何基于 Apache Kyuubi & Celeborn 等开源技术，构建企业级 Spark on Kubernetes 云原生离线计算平台展开，包含技术选型、架构设计、经验教训、缺陷改进、降本增效等内容，深入剖析网易在该领域的探索成果。

嘉宾介绍：

潘成丨网易数帆软件工程师，Apache Kyuubi PMC，Apache Celeborn PPMC

网易数帆软件工程师，Apache Kyuubi PMC 成员，Apache Celeborn (Incubating) PPMC 成员。主要从事企业级离线计算引擎开发、Apache Kyuubi 开源社区建设等工作。

演讲议题：Resilient Data: Exploring Replication and Recovery in Apache Ozone

分享时间：8 月 18 日 16:15 - 16:45

议题介绍：

Data resilience is crucial in modern distributed systems to ensure data availability and durability. Apache Ozone, a scalable and distributed object store that has the capability to handle billions of objects, addresses the need for resilient data storage through its replication and recovery mechanisms.

This talk delves into the concepts and techniques employed by Apache Ozone to achieve high data resilience. The first part of the talk explores data replication in Apache Ozone. It discusses how Ozone maintains strong consistency by keeping consistent copies of blocks across all nodes and also briefly touches upon how one can reduce data redundancy using the Erasure coding feature. The second part, which is the crux of the talk, deals with data backup and recovery. It will discuss how one can use effective backup strategies like cross-cluster replication, Ozone snapshots, etc. This talk serves as a comprehensive guide for exploring the resilience aspects of Apache Ozone, enabling practitioners to leverage its capabilities effectively and make informed decisions when designing data-intensive applications.

嘉宾介绍：

Sadanand Shenoy丨Cloudera Software Engineer II

Sadanand Shenoy is a committer in the Apache Ozone project and has keen interest in distributed systems . Sadanand is currently working at Cloudera and has been actively contributing to the Apache Ozone project for the past 3 years. He has pursued a B.E in Information Science and Engineering from MSRIT Bangalore.

演讲议题：Linkis 在理想汽车的应用实践

分享时间：8 月 18 日 16:45 - 17:15

议题介绍：

Apache Linkis 是在上层应用程序和底层引擎之间构建的一层计算中间件。本次分享的内容主要包括：为何我们选择 Linkis 作为理想汽车内部的中间件；在 Linkis 的落地实践过程中，我们添加和修复了哪些功能。以及这些功能如何让我们能够更好地满足开发需求，提高工作效率；我们在实践中遇到的一些挑战和问题以及我们所采取的解决方案和建议；计划添加的新功能和改进。希望通过本次分享为正在使用和计划使用Linkis作为中间件的团队提供一些经验。

嘉宾介绍：

郗世豪丨理想汽车高级大数据工程师

理想汽车高级大数据工程师，主持开发 Linkis 1.3.2 版本，Linkis Committer，入职公司5年，现在在公司主要负责 Linkis 和 Spark 的二次开发，致力于在公司内部落地和推广 Linkis 平台。通过和 Spark 等底层引擎的结合，努力探索更加高效、灵活的数据处理方案，最终提升用户效率。

8 月 19 日 13:30 - 17:15

演讲议题：数据安全：Apache Ozone 如何保证数据的存储和访问安全

分享时间：8 月 19 日 13:30 - 14:00

议题介绍：

Apache Ozone 是 Apache 基金会下的新一代分布式存储，构架简洁，扩展性好，同时支持 S3 对象协议，和 Hadoop 文件系统。支持 MR, Hive, Spark 和 Impala 等计算引擎; 支持 AWS 客户端访问；丰富的企业级特性。数据安全是存储系统的基石。本次分享将主要介绍 Apache Ozone 数据安全功能，包括数据的存储可靠性，副本容灾性，数据巡检，数据校验等等，和访问安全性，认证，鉴权，加密，日志等等。通过这些功能，帮助用户实现一个安全可靠的大数据存储系统。

嘉宾介绍：

陈怡丨Cloudera 首席存储工程师

Apache Ozone 开源社区 PMC 主席，长期专注于分布式存储领域。目前就职于 Cloudera，担任首席存储工程师。曾就职于腾讯和 Intel，担任大数据存储技术负责人。

演讲议题：字节跳动 MapReduce -> Spark 平滑迁移实践

分享时间：8 月 19 日 14:00 - 14:30

议题介绍：

随着业务发展，字节跳动内部每天线上约运行 120 万个 Spark 作业，与之相对比的是，线上每天依然约有两万到三万个 MapReduce 任务。作为一个历史悠久的批处理框架，从大数据研发的角度来看，MapReduce 引擎的运维面临了一系列问题。例如，框架更新迭代的的 ROI 较低，对于新的计算调度框架适配性较差等等。而从用户的角度来看， MapReduce 引擎的使用也存在一系列的问题。例如，计算性能不佳，需要额外的 Pipeline 工具管理串行运行的 Job，希望迁移 Spark 但是存量作业数量多且大量作业使用了 Spark 本身不支持的各种脚本。在此背景下，字节跳动 Batch 团队设计并实现了一套 MapReduce 任务平滑迁移 Spark 的方案，该方案使用户仅需对存量作业增加少量的参数或环境变量即可完成从 MapReduce 到 Spark 的平缓迁移，大大降低了迁移成本，并且取得了不错的成本收益。

嘉宾介绍：

魏中佳丨字节跳动基础架构工程师

2018 年加入字节跳动，现任字节跳动基础架构大数据开发工程师，专注大数据分布式计算领域，主要负责 Spark 内核开发、字节自研 Shuffle Service 开发。

演讲议题：Apache Kudu 在神策的应用和实践

分享时间：8 月 19 日 14:30 - 15:00

议题介绍：

Apache Kudu 在神策的应用中遇到的困难，我们的解决方案，以及我们未来对 Apache Kudu 的规划。

重点介绍以下三点：

1. Apache Kudu 的数据迁移;

2. 解决 Apache Kudu 启动慢的问题;

3. 解决 Apache Kudu metadata 存储问题.

嘉宾介绍：

汪细勖丨神策网络科技（北京）有限公司分布式软件开发工程师

2017 年毕业于北京航空航天大学，长年致力于互联网大数据的基础架构建设，主要从事分布式存储计算系统的开发及应用工作。热爱开源，积极参与开源社区的工作，先后参与 Apache Doris, Apache Pegasus 和 Apache Kudu 的开源项目，并且是 Apache Doris committer。目前供职于神策网络科技有限公司基础研发部存储组。

演讲议题：小米 HDFS 数据治理实践与演进

分享时间：8 月 19 日 15:00 - 15:30

议题介绍：

HDFS 作为小米底层数据存储系统，随着公司业务的高速发展，数据规模飞速增长，存储成本也快速上升， HDFS 数据治理成为了一件无法避开的事情。

本次分享着重于介绍小米内部进行 HDFS 数据治理的背景，如何基于冷热温数据分层存储思想，利用性价比更高的公有云对象存储，实现 HDFS 数据治理的实践与演进过程，以及未来的数据治理规划。

嘉宾介绍：

王成伟丨小米高级软件研发工程师

小米高级软件开发工程师，HDFS Contributor，多年的 HDFS 优化与维护经验。在小米主要负责 HDFS 相关的优化与维护工作。

演讲议题：Apache Celeborn(Incubating): 让 Spark 和 Flink 更快更稳更弹性

分享时间：8 月 19 日 15:45 - 16:15

议题介绍：

Apache Celeborn (Incubating) 是一个高性能，高可用，可伸缩的通用 Shuffle 服务，支持 Spark/Flink 两大主流引擎(未来将支持 Tez/MR 等更多引擎)。Celeborn 在阿里及多家知名企业支撑每天数十 P 的生产 Shuffle，提升稳定性和性能的同时降低成本。本次分享将介绍 Celeborn 的高性能高可用核心设计，支持多引擎的统一架构，用户案例，以及如何更好的参与社区。

嘉宾介绍：

周克勇丨阿里云 EMR Spark 引擎负责人

阿里云 EMR Spark 引擎负责人，Apache Celeborn (Incubating)的初始作者，在 Remote Shuffle Service，向量化引擎，优化器等方面有一定经验。

演讲议题：基于 Apache Linkis 快速高效构建数据应用工具

分享时间：8 月 19 日 16:15 - 16:45

议题介绍：

介绍 Apache Linkis 以及社区发展情况，并讲述 Apache Linkis 是如何作为数据应用工具的开发基座，降低上层应用工具在连通、扩展、管控、复用等计算治理方面的开发工作量，比如数据质量工具只需关注质量规则的管理，而无需处理任务的高并发和多租户问题。此外，我们还将探讨基座为数据应用工具提供了哪些必不可少的功能。

嘉宾介绍：

王和平丨微众银行高级工程师

Apache Linkis PMC 现在就职于微众银行，主要负责 Linkis、Spark、Trino、DataSphereStudio 等项目的开发和运营工作。

演讲议题：How increasing partition size in Apache Cassandra can reduce disk usage by over 30%

分享时间：8 月 19 日 16:45 - 17:15

议题介绍：

Did you know that over-partitioning in Apache Cassandra can lead to excessive storage requirements? In this presentation, we explore how, at Instaclustr, were able to reduce the storage footprint of our metrics data by over 30%, from 244tb to 157tb, and improve general performance of our cluster - simply by making a small change to the schema of the tables we were using. Instaclustr manages a fleet of over 10 000 customer servers as part of our managed service offering and part of that system includes real time metrics collection from the operating system and running applications which are stored in a 70 node Apache Cassandra cluster. We will go into detail explaining what problems the existing schema was designed to solve, how our Cassandra experts determined what we needed to change, and why the change was able to drastically improve our storage efficiency without major changes to our downstream systems.

嘉宾介绍：

John Del Castillo丨NetApp Technology Evangelist

John Del Castillo is a software engineer with over 15 years of experience developing enterprise software solutions across a variety of languages and technologies. For 6 years he worked at Instaclustr as a Lead Engineer, and for the last year he has taken the mantle of Technology Evangelist, specializing in open-source technology. In this role, he explores the landscape of open-source technologies, explores new solutions, documents interesting use cases and creates written and video content to help educate and encourage people to use open source for their business.

专题议程

CommunityOverCode Asia 2023

作为 Apache 软件基金会（ASF）的官方全球系列大会，每年的 CommunityOverCode Asia 都吸引着来自全球各个层次的参与者、社区共同探索 "明天的技术"。8 月 18 日至 20 日，即将强势来袭的 CommunityOverCode Asia 2023 上，大家可以近距离感受来自 Apache 项目的最新发展和新兴创新。

继续滑动看下一个

ALC Beijing

向上滑动看下一个

一把短刀，怎么就让他连捅18人？！

听纪委朋友说，有的领导干部在被抽掉鞋带和皮带后，一下就崩溃了，甚至个别胆小者顿时大小便失禁……

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

野村：牛市可能重蹈2015年的崩盘

CommunityOverCode Asia 专题介绍之数据存储与计算

8 月 18 日 13:30 - 17:15

8 月 19 日 13:30 - 17:15

您可能也对以下帖子感兴趣

一把短刀，怎么就让他连捅18人？！

听纪委朋友说，有的领导干部在被抽掉鞋带和皮带后，一下就崩溃了，甚至个别胆小者顿时大小便失禁……

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

野村：牛市可能重蹈2015年的崩盘

生成图片，分享到微信朋友圈

CommunityOverCode Asia 专题介绍之数据存储与计算

8 月 18 日 13:30 - 17:15

8 月 19 日 13:30 - 17:15

您可能也对以下帖子感兴趣